DegExt: a language-independent keyphrase extractor
نویسندگان
چکیده
In this paper, we introduce DegExt, a graph-based languageindependent keyphrase extractor,which extends the keyword extraction method described in (Litvak & Last, 2008). We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx (Turney, 2000) and TextRank (Mihalcea & Tarau, 2004). We evaluated DegExt on collections of benchmark summaries in two different languages: English and Hebrew. Our experiments on the English corpus show that DegExt significantly outperforms TextRank and GenEx in terms of precision and area under curve (AUC) for summaries of 15 keyphrases or more at the expense of a mostly non-significant decrease in recall and F-measure, when the extracted phrases are matched against gold standard collection. Due to DegExt’s tendency to extract bigger phrases than GenEx and TextRank, when the single extracted words are considered, DegExt outperforms them both in terms of recall and F-measure. In the Hebrew corpus, DegExt performs the same as TextRank disregarding the number of keyphrases. An additional experiment shows that DegExt applied to the TextRank representation graphs outperforms the other systems in the text classification task. For documents in both languages, DegExt surpasses both GenEx and TextRank in terms of implementation simplicity and computational complexity.
منابع مشابه
DegExt - A Language-Independent Graph-Based Keyphrase Extractor
In this paper, we introduce DegExt, a graph-based languageindependent keyphrase extractor,which extends the keyword extraction method described in [6]. We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx [11] and TextRank [8]. Our experiments on a collection of benchmark summaries show that DegExt outperforms TextRank and GenEx in terms of precision and area un...
متن کاملNoun Compound and Named Entity Recognition and their Usability in Keyphrase Extraction
We investigate how the automatic identification of noun compounds and named entities can contribute to keyphrase extraction and we also show how previously identified noun compounds affect named entity recognition and vice versa, how noun compound detection is supported by identified named entities. Our experiments demonstrate that already known noun compounds yield better performance in named ...
متن کاملAdaptation of a Keyphrase Extractor for Japanese Text*
This paper presents some statistical observations relevant to Japanese keyphrase extraction, as well as the details of the implementation of a keyphrase extraction algorithm (called Extractor) for Japanese documents. Parts of the algorithm include an efficient method of extracting the keyphrase candidates, a way to pinpoint the most probable keyphrases using contextual information, a technique ...
متن کاملUsing Noun Phrase Heads to Extract Document Keyphrases
Automatically extracting keyphrases from documents is a task with many applications in information retrieval and natural language processing. Document retrieval can be biased towards documents containing relevant keyphrases; documents can be classified or categorized based on their keyphrases; automatic text summarization may extract sentences with high keyphrase scores. This paper describes a ...
متن کاملSupervised Keyphrase Extraction as Positive Unlabeled Learning
The problem of noisy and unbalanced training data for supervised keyphrase extraction results from the subjectivity of keyphrase assignment, which we quantify by crowdsourcing keyphrases for news and fashion magazine articles with many annotators per document. We show that annotators exhibit substantial disagreement, meaning that single annotator data could lead to very different training sets ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Ambient Intelligence and Humanized Computing
دوره 4 شماره
صفحات -
تاریخ انتشار 2013